{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Probabilistic forecasting and multi-layer perceptron.\n",
    "\n",
    "This is the second part of our introduction to forecasting. \n",
    "\n",
    "This notebook starts with a short introduction to __maximum likelihood estimation__ and to __general linear models__. These concepts are used to build a model to generate a __probabilistic forecast of count data__ using a Poisson distribution. At this point you might want to take a quick look at the crash course on probability and statistics in chapter 1. \n",
    "\n",
    "The second part of this notebook is dedicated building a __multi-layer perceptron (MLP) for forecasting__, using the same data and Poisson distribution as in the first part of this notebook. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "import mxnet as mx\n",
    "import numpy as np\n",
    "from mxnet import nd, autograd, ndarray\n",
    "from mxnet import gluon\n",
    "from mxnet.gluon.loss import Loss\n",
    "from math import factorial\n",
    "import pandas as pd\n",
    "%matplotlib inline\n",
    "from matplotlib import pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Maximum likelihood estimation \n",
    "\n",
    "In the previous notebook we used linear models of the form $Y_t = X_t\\beta + \\epsilon_t$. The parameters $\\beta$ are estimated by minimizing the mean squared error \n",
    "$$\\frac{1}{T}\\sum_t \\hat\\epsilon_t^2 = \\frac{1}{T}\\sum_t (Y_t - X_t\\hat\\beta)^2.$$\n",
    "\n",
    "#### The model as a distribution\n",
    "\n",
    "By making an extra assumption we can view this model and loss function in a more principled way. Let's assume that __the innovations $\\epsilon_t$ are independently drawn from a Gaussian distribution__: $\\epsilon_t \\sim \\mathcal{N}(0,\\sigma^2)$. With this extra assumption we can re-write the model as a condition distribution, which is a didtribution that depends on variable $X$ that are determined outside the model. \n",
    "\n",
    "Since we assume the errors $\\epsilon$ are Gaussian: \n",
    "$$y_t = X_t\\beta + \\epsilon_t \\implies Y_t \\sim \\mathcal{N}(X_t\\beta,\\sigma_2),$$\n",
    "meaning that $y_t$ __follows a Gaussian distribution__ with mean $X_t\\beta$ and variance $\\sigmaˆ2$. The mean and variance of the distribution are functions of the unknown parameters $\\beta$ and $\\sigma$. Had we known these parameters we would have known the distribution of $Y_t$ fully, and so we could compute \n",
    "$$Pr(y_t<\\mu \\mid X_t, \\beta, \\sigma)=p_\\mu,\\ \\ \\mu\\in\\mathbb{R}.$$\n",
    "\n",
    "$Pr(y_t<\\mu \\mid X_t, \\beta, \\sigma)=p_\\mu$ is __the probability that we have observed the data $y_t$__ conditional on some value of the parameters ($\\beta, \\sigma$) and on $X_t$. The problem can be turned around, for what value of the parameters is the probabily that we observe $y_t$ maximized? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Likelihood function and Maximum Likelihood estimation\n",
    "\n",
    "The __likelihood function__ is defined as:\n",
    "$$ L(\\beta; y_t, X_t, \\sigma) = Pr(y_t \\mid X_t, \\beta, \\sigma).$$\n",
    "The __maximum likelihood estimator__ (MLE) of our parameters is the maximizer of $L(\\beta; y_t, X_t, \\sigma)$: $(\\hat\\beta,\\hat\\sigma)=arg\\,max_{\\beta,\\sigma}\\left(L(\\beta; y_t, X_t, \\sigma)\\right)$, though in practice we generally minimize the opposit of the logarithm of the the likelihood function, the __log likelihood function__ $-\\log L(\\beta; y_t, X_t, \\sigma)=-\\log\\left(L(\\beta)\\right)$. \n",
    "\n",
    "I have skipped entirely a lot of extremely important points in this very brief introduction to maximum likelihood, the goal was just to mention the main concepts. There are lots of excellent resourcres out there to read a proper discussion of these topics. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### MLE with Gluon\n",
    "To enable MLE with `gluon` we need to define the log likelihood loss function.\n",
    "Let us look at a concrete example. A random variable following a Bernoulli distribution takes value 0 with probability p, or 1 with probability 1-p. For a sample of $N$ independent Bernoulli distributed random variables $y_i$, the log likelihood is $L(p)=\\frac{1}{N}\\sum_i\\left(p y_i + (1-p)(1-y_i)\\right)$. \n",
    "\n",
    "Few likelihood functions are included in the current release of `gluon`, but defining a custom loss function is easy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the Bernoulli loss function\n",
    "\n",
    "class Bernoulli(Loss):\n",
    "    \"\"\"Calculates the Bernoulli loss function\n",
    "    Output and label must have the same shape. This is a scalar loss function.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    weight : float or None\n",
    "        Global scalar weight for loss.\n",
    "    sample_weight : Symbol or None\n",
    "        Per sample weighting. Must be broadcastable to\n",
    "        the same shape as loss. For example, if loss has\n",
    "        shape (64, 10) and you want to weight each sample\n",
    "        in the batch, `sample_weight` should have shape (64, 1).\n",
    "    batch_axis : int, default 0\n",
    "        The axis that represents mini-batch.\n",
    "    \"\"\"\n",
    "    def __init__(self, weight=None, batch_axis=0, **kwargs):\n",
    "        super(Bernoulli, self).__init__(weight, batch_axis, **kwargs)\n",
    "\n",
    "    def hybrid_forward(self, F, output, label, sample_weight=None):\n",
    "        label = _reshape_label_as_output(F, output, label)\n",
    "        loss = -1 * (1 - output) * (1-label) - output * label\n",
    "        loss = _apply_weighting(F, loss, self._weight, sample_weight)\n",
    "        return F.mean(loss, axis=self._batch_axis, exclude=True)\n",
    "    \n",
    "bernoulli_log_lik = Bernoulli()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Gaussian MLE\n",
    "\n",
    "When we have a Gaussian linear model, it's quite easy to show that the values $\\hat\\beta$ and $\\hat\\sigma$, the MLEs, that minimize the log likelihood function are:\n",
    " \n",
    "  1. The value of $\\hat\\beta$ that minimimze $\\frac{1}{T}\\sum_t\\left( y_t-X_t\\beta\\right)ˆ2$, that is, __the minimizer of the L2 loss__. \n",
    "  2. $\\hat\\sigma^2 = \\frac{1}{T}\\sum_t \\epsilon_t^2 = \\frac{1}{T}\\sum_t (y_t - X_t\\hat\\beta)^2$, the same estimator we used earlier. \n",
    "\n",
    "In the Gaussian case the maximum likelihood estimator is equivalent to minimizing the L2 loss. This implies that to do Gaussian MLE we do not need to setup a particular function, we can just use L2 loss: `\n",
    "gaussian_log_lik = gluon.loss.L2Loss()`.\n",
    "\n",
    "One advantage of the likelihood framework is that we work with distributions, including in our forecasts. Another advantage is that we can use a wide class of probability distributions to describe the data.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generalized linear models\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "### Count forecasting\n",
    "\n",
    "To forecast counts we are going to use a discrete probability distribution that takes values ${0,1,2,...}$, the __Poisson distribution__. A random variable following a Poisson distribution with parameter $\\lambda_t>0$ takes values $y_t = 0,1,...$ with probability $Pr(y_t=\\mu) = \\frac{e^{-\\lambda_t}\\lambda_t^\\mu}{\\mu!}$. $\\lambda$ is both the mean and the variance of the Poisson distribution.\n",
    "\n",
    "Suppose that we would like to model the parameter $\\lambda_t$ of a Poisson distributed sample as a function of some explanatory variables $X_t$. We could posit $\\lambda_t=X_t\\beta$, with $X_t$ a column matrix of features and $\\beta$ a column vector of parameters. The specification $\\lambda_t=X_t\\beta$ makes it possible for the mean and variance of the process to be negative, we can't have that! Instead we assume that $\\lambda$ is a function of $X\\beta$: $\\lambda = f(X\\beta)$, $f(.)$ is called the link function.\n",
    "\n",
    "A common choice for the link function is to assume that the parameter of the Poisson distribution is log-linear, $\\log\\lambda_t = X_t\\beta$, implying that the link function is the exponential: $\\lambda = f(X\\beta)$. This choice can be problematic since we might end up taking the exponential of a large number leading to overflow. Instead we will use a logistic link function: $f(x) = \\log(1 + exp(x))$. \n",
    "\n",
    "With link function $f$, we have $f(E(y_t|X_t))=X_t\\beta$. The log-likelihood of some parameters $\\beta$ at time $t$ is:\n",
    "$$L(\\beta | X_t,y_t) =  y_t \\log(\\lambda_t(\\beta)) - \\lambda_t(\\beta)\\\\ = y_t \\log(f(X_t\\beta)) - f(X_t\\beta).$$\n",
    "\n",
    "In practice we will minimize $-L(\\beta | X_t,y_t) = y_t \\log(f(X_t\\beta)) - f(X_t\\beta)$. We define this log likelihood function below together with some helper functions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def logistic(x):\n",
    "    return nd.log1p(nd.exp(x,dtype='float64'))\n",
    "\n",
    "def inverse_logistic(x):\n",
    "    return nd.log(nd.exp(x,dtype='float64')-1)\n",
    "\n",
    "def _reshape_label_as_output(F, output, label):\n",
    "    return label.reshape(output.shape) if F is ndarray else label.reshape(())\n",
    "\n",
    "\n",
    "class Poisson(Loss):\n",
    "    def __init__(self, weight=None, batch_axis=0, **kwargs):\n",
    "        super(Poisson, self).__init__(weight, batch_axis, **kwargs)\n",
    "\n",
    "    def hybrid_forward(self, F, output, label, sample_weight=None):\n",
    "        label = _reshape_label_as_output(F, output, label)\n",
    "        loss = logistic(output) - F.log(logistic(output)) * label\n",
    "        return F.mean(loss, axis=self._batch_axis, exclude=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Weekly sales for a novelty item (p.37-38: Montgomery) \n",
    "# From datamarket.com \n",
    "    \n",
    "df = pd.read_csv(\"https://datahub.io/core/sea-level-rise/r/csiro_alt_gmsl_mo_2015.csv\").set_index('Time')\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ts = df.values[-50:,0]\n",
    "plt.plot(ts);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Features\n",
    "The data clearly appears to be trending upward, so the model will include a linear trend. There also might be some autocorrelation so the model will include a lag of the target variable. There does not appear to be any kind of seasonal pattern, so we won't include any seasonal features. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "forecast_length = 10\n",
    "train_length = len(ts) - forecast_length\n",
    "\n",
    "tgt_min = ts.min()\n",
    "ts = ts - tgt_min\n",
    "\n",
    "# droping the first observation due to lag\n",
    "target = nd.array(ts[:train_length]).reshape((train_length,1))[1:]\n",
    "\n",
    "# prediction target\n",
    "pred_target = nd.array(ts[train_length:]).reshape((forecast_length,1))\n",
    "\n",
    "# construct lag and trend\n",
    "trend = nd.arange(train_length).reshape((train_length,1))\n",
    "lag_sales = nd.array(ts).reshape((train_length,1))\n",
    "\n",
    "# droping the last observation due to lag\n",
    "features = nd.concat(trend[:-1],lag_sales[:-1])\n",
    "\n",
    "# standardize\n",
    "features_mean = features.mean(axis=0)\n",
    "features_std = nd.array(features.asnumpy().std(axis=0)).reshape((1,1))\n",
    "features = (features - features_mean) / features_std\n",
    "\n",
    "print(features[:5,])\n",
    "print(target[:5,])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 5\n",
    "train_data = gluon.data.DataLoader(\n",
    "    gluon.data.ArrayDataset(features, target),\n",
    "    batch_size=batch_size, shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Model\n",
    "\n",
    "The GLM is still just a linear model, and so we define the network as a single dense layer with a dimension 1 output. The only difference with the Gaussian linear models of the previous notebook is that we now assume the data follows a Poisson distribution and so use the Poisson log likelihood as our loss function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Context\n",
    "ctx = mx.cpu()\n",
    "\n",
    "# Network\n",
    "net = gluon.nn.Sequential()\n",
    "with net.name_scope():\n",
    "    net.add(gluon.nn.Dense(1))\n",
    "    \n",
    "# Parameter initialization    \n",
    "net.collect_params().initialize(mx.init.Normal(sigma=0.1), ctx=ctx)\n",
    "\n",
    "# Trainer\n",
    "trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})\n",
    "\n",
    "# Loss\n",
    "poisson_log_lik = Poisson()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training loop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "epochs = 50\n",
    "smoothing_constant = .05\n",
    "moving_loss = 0\n",
    "niter = 0\n",
    "loss_seq = []\n",
    "\n",
    "for e in range(epochs):\n",
    "    for i, (data, label) in enumerate(train_data):\n",
    "        data = data.as_in_context(ctx)\n",
    "        label = label.as_in_context(ctx)\n",
    "        with autograd.record():\n",
    "            output = net(data)\n",
    "            loss = poisson_log_lik(output, label)\n",
    "        loss.backward()\n",
    "        trainer.step(batch_size)\n",
    "        \n",
    "        ##########################\n",
    "        #  Keep a moving average of the losses\n",
    "        ##########################\n",
    "        niter +=1\n",
    "        curr_loss = nd.mean(loss).asscalar()\n",
    "        moving_loss = (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss\n",
    "\n",
    "        # correct the bias from the moving averages\n",
    "        est_loss = moving_loss/(1-(1-smoothing_constant)**niter)\n",
    "        loss_seq.append(est_loss)\n",
    "    if e % 10 ==0:\n",
    "        print(\"Epoch %s. Moving avg of log likelihood: %s\" % (e, est_loss)) \n",
    "        \n",
    "        params = net.collect_params() # this returns a ParameterDict\n",
    "\n",
    "print('The type of \"params\" is a ',type(params))\n",
    "\n",
    "# A ParameterDict is a dictionary of Parameter class objects\n",
    "# therefore, here is how we can read off the parameters from it.\n",
    "\n",
    "for param in params.values():\n",
    "    print(param.name,param.data())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the convergence of the estimated loss function\n",
    "%matplotlib inline\n",
    "\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(num=None,figsize=(8, 6),dpi=80, facecolor='w', edgecolor='k')\n",
    "plt.plot(range(niter),loss_seq, '.')\n",
    "\n",
    "# adding some additional bells and whistles to the plot\n",
    "plt.grid(True,which=\"both\")\n",
    "plt.xlabel('iteration',fontsize=14)\n",
    "plt.ylabel('est loss',fontsize=14)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Forecast\n",
    "\n",
    "As was the case with the Gaussian AR(1) of the previous notebook, the specification of our Poisson model is autoregressive, a feature is the previous value of the target. We are therefore going forecast recursively.\n",
    "\n",
    "\n",
    "The value our model predict is $X_t\\beta = f(\\lambda_t)$ where $f(.)$ is the link function which in this application is the logistic function. To recover the parameter $\\lambda$ of the Poisson distribution, which is what we are ultimately interested in estimating and predicting, we need to transform the output with $fˆ{-1}(.)$. That's what the `inverse_logistic` function is for.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def forecast_poisson(net, last_obs, features_mean, features_std,\n",
    "                     forecast_length, train_length):\n",
    "    forecast = nd.empty((forecast_length,1))\n",
    "    for t in range(forecast_length):\n",
    "        if t==0:\n",
    "            prev_obs = last_obs.reshape((1,1))\n",
    "        # construct features\n",
    "        trend = nd.array([train_length + t - 1]).reshape((1,1))\n",
    "        #fct_feat = trend\n",
    "        fct_feat = nd.concat(trend,prev_obs,dim=1)\n",
    "        # normalize features\n",
    "        fct_feat = (fct_feat - features_mean)/features_std\n",
    "        # forecast\n",
    "        fct = net(fct_feat)\n",
    "        forecast[t,] = fct[0][0]\n",
    "        prev_obs = inverse_logistic(fct)\n",
    "    return forecast\n",
    "  \n",
    "fit = net(features)    \n",
    "fit_trans = inverse_logistic(fit)\n",
    "\n",
    "fct = forecast_poisson(net,target[train_length-2],\n",
    "                            features_mean, features_std,\n",
    "                            forecast_length, train_length)\n",
    "fct_trans = inverse_logistic(fct)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_forecast(observed, fitted, forecasted):\n",
    "    plt.plot(fitted.asnumpy(), color=\"r\")\n",
    "    plt.plot(observed, color=\"g\")\n",
    "    T = len(fitted)\n",
    "    plt.plot(np.arange(T, T+len(forecasted)), forecasted.asnumpy(), color=\"b\")\n",
    "    plt.legend([\"Fitted\", \"Observed\", \"Forecasted\"]);\n",
    "    \n",
    "plot_forecast(ts[1:],fit,fct)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 90% prediction intervals\n",
    "\n",
    "Let's put a 90% prediction interval around our forecast. We proceed as previously, estimating the standard errors of the residuals and relying on the fact that the maximum likelihood estimates are asymptotically Gaussian to construct the intervals. There are several alternative ways to construct intervals relying on weaker or different assumptions, but this is a topic for another day. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "se_fit = (target - fit_trans).asnumpy().std()\n",
    "interval = np.concatenate((inverse_logistic(fct - 1.65*se_fit).asnumpy(),\n",
    "                           inverse_logistic(fct + 1.65*se_fit).asnumpy()), axis=1)\n",
    "\n",
    "def plot_forecast_interval(observed, fitted, forecasted, interval):\n",
    "    plt.plot(fitted.asnumpy(), color=\"r\")\n",
    "    T = len(fitted)\n",
    "    plt.plot(np.arange(T, T+len(forecasted)), forecasted.asnumpy(), color=\"b\")\n",
    "    plt.fill_between(np.arange(T, T+len(forecasted)), interval[:,0], interval[:,1], alpha=.3)\n",
    "    plt.plot(observed, color=\"g\")\n",
    "    plt.legend([\"Fitted\", \"Observed\", \"Forecasted\"]);\n",
    "    \n",
    "plot_forecast_interval(ts[1:], fit, fct, interval)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mean squared prediction error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "msfe_glm = nd.mean(nd.power(fct - pred_target,2))\n",
    "print(msfe_glm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Forecasting with a multilayer perceptron\n",
    "\n",
    "We're going to use a multilayer perceptron (MLP) as in chapter 3. The network is composed of three layers, 2 dense layers with 64 hidden units and a _relu_ activation layer, and a one-dimensional output layer. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_hidden = 64\n",
    "num_outputs = 1\n",
    "mlp_net = gluon.nn.Sequential()\n",
    "with mlp_net.name_scope():\n",
    "    mlp_net.add(gluon.nn.Dense(num_hidden, activation=\"relu\"))\n",
    "    mlp_net.add(gluon.nn.Dense(num_hidden, activation=\"relu\"))\n",
    "    mlp_net.add(gluon.nn.Dense(num_outputs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Initialization\n",
    "\n",
    "Here we initialize the parameters using the Xavier algorithm, set up the trainer, and define the loss function. Since we are modeling and predicting continuous variables, I use a square Loss. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# context\n",
    "mlp_ctx = mx.cpu()\n",
    "# Parameters\n",
    "mlp_net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=mlp_ctx)\n",
    "# trainer\n",
    "trainer = gluon.Trainer(mlp_net.collect_params(), 'sgd', {'learning_rate': .1})\n",
    "# likelihood\n",
    "poisson_ll = Poisson()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 5\n",
    "train_data = gluon.data.DataLoader(\n",
    "    gluon.data.ArrayDataset(features, target),\n",
    "    batch_size=batch_size, shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training loop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "epochs = 100\n",
    "smoothing_constant = .05\n",
    "\n",
    "for e in range(epochs):\n",
    "    cumulative_loss = 0\n",
    "    for i, (data, label) in enumerate(train_data):\n",
    "        data = data.as_in_context(mlp_ctx)\n",
    "        label = label.as_in_context(mlp_ctx)\n",
    "        with autograd.record():\n",
    "            output = mlp_net(data)\n",
    "            #print(output)\n",
    "            #loss = square_loss(output, label)\n",
    "            loss = poisson_ll(output, label)\n",
    "            loss.backward()\n",
    "        trainer.step(data.shape[0])\n",
    "        cumulative_loss += nd.sum(loss).asscalar()\n",
    "    if e % 10 ==0:\n",
    "        print(\"Epoch %s. Loss: %s\" % (e, cumulative_loss))\n",
    "    \n",
    "for param in params.values():\n",
    "    print(param.name,param.data())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Forecasts and prediction intervals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fit and residual standard errors\n",
    "fit = mlp_net(features)\n",
    "se_fit_mlp = (target - fit).asnumpy().std()\n",
    "\n",
    "# forecast\n",
    "fct = forecast_poisson(mlp_net,target[train_length-2],\n",
    "                            features_mean, features_std,\n",
    "                            forecast_length, train_length) \n",
    "# prediction interval\n",
    "interval = np.concatenate((inverse_logistic(fct - 1.65*se_fit).asnumpy(),\n",
    "                           inverse_logistic(fct + 1.65*se_fit).asnumpy()), axis=1)\n",
    "\n",
    "plot_forecast_interval(ts[1:],inverse_logistic(fit),inverse_logistic(fct),interval)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "msfe_mlp = nd.mean(nd.power(fct - pred_target,2))\n",
    "\n",
    "print(\"Mean squared forecast error of the GLM: %s, of the MLP: %s\" % (msfe_glm, msfe_mlp))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
